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_j . Abstract 

, We describe a new "coupling from the past" algorithm for variable 

length Markov chains and random systems with perfect connections. This 
' algorithm generalizes Propp and Wilson's simulation scheme, and im- 

£SJ . proves on existing algorithms in that it requires much less restrictive con- 

ditions on the kernel to converge. It proves interesting even in the sim- 
ple case of finite order Markov chains in terms of speed of convergence. 
Chains of variable or infinite order are an old objet of consideration that 
Qh ' raised considerable interest recently because of its applications in applied 

(-H _ probability, from information theory to bio-informatics and linguistics. 

^ , 

C$ , Keywords: Perfect simulation, Context trees, Markov Chains of infinite 

order, Coupling from the past 

1 Introduction 

Since the seminal paper of [19j , perfect simulation schemes for stationary Markov 
chains have been devcloppcd and applied in several fields of applied probabilities, 



ly*-) ■ from statistical physics to Bayesian statistics (see for example [TB] and references 



therein, or [12j for a gentle introduction), 
i In 2002, Comets, Ferandez and Ferrari [5] have proposed an extension to 

processes with long memory: they provided a perfect simulation algorithm for 
stationary processes called random systems with complete connections [T71 Q2] 
or chains of infinite order |13j ; these processes are characterized by a transition 
kernel which specifies, given an infinite sequence of past symbols, the probability 
distribution of the next symbol. The idea was, after jlH [15j [2], to exploite 
. regenerative structures of these processes. Their algorithm relies on renewal 

properties, under a summablc memory decay condition. As a by-product, the 
authors obtained the existence of stationary process for a given kernel, together 
with unicity properties under suitable hypotheses. 

However, it appeared that these conditions on the kernel were quite restric- 
tive, and actually not necessary. Foss [5] and Gallo [5] showed that different cou- 
pling schemes could be designed under alternative assumptions that do not even 
require the kernel to be continuous. Actually, the coupling scheme of Comets, 
Ferandez and Ferrari strongly relies on regeneration, contrary to Propp and 
Wilson's algorithm: thus, it cannot compete with the latter on the simple case 
of Markov chains in terms of convergence time. Recently, De Santis and Pic- 
cioni tried to reconciliate the two algorithms, by providing an hybrid methods 
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that works with two regimes: coalescence for short memory, and regeneration 
on long scales. 

In this paper, we fill the gap between long and short scales, by providing 
a relatively elaborated and yet elegant coupling procedure that purely relies 
on coalescence. It generalizes Propp and Wilson's algorithm, in the sense that 
it is exactly similar for (first order) Markov chains. But it permits to han- 
dle arbitrary continuous infinite memory stationary processes, and even some 
non-continuous ones under particular conditions. The idea is to exploit local 
coalescence, instead of global loss of memory properties. We show that this per- 
fect simulation scheme can be proved to converge under much less restrictives 
hypotheses than was known previously. Even when different schemes can be 
applied, it is converges (possibly very significantly) faster. As they prove very 
useful in many applications (e.g. information Theory |201 121] or bio-informatics 
[U), we set a special focus on perfect simulation of finite, but large order Markov 
chains (or for Variable order Markov Chains, see [3]): our algorithm compares 
very favorably with Propp and Wilson's algorithm on the extended chain in 
terms of computational complexity. 

The paper is organized as follows: Section [2] presents the required notation 
and definitions, together with a few first properties. Section [3] contains the 
conceptual description of the perfect simulation schemes. It relies on a coupling 
function that is described in great detail in Section [4j Section [5] contains the 
precise description of the algorithm. Then, Section[S]gathers elements of analysis 
of the complexity of the algorithm, while Section [7] illustrates the weakness of 
the assumptions required for the algorithm to converge. 

2 Notation and background 

2.1 Spaces and topologies 

Following [5], we denote by G a finite alphabet. For k £ N, we denote by G~ k 
the set of all sequences (w-k, ■ • ■ , i«-i), and G* = Uk>oG~ k . By convention, 
e denotes the empty sequence, and G° = {e}. The set G~ N is the space of 
histories, and w_oo : _i £ G~ N will be denote by w. The history w such that 
for all k < 0,Wk = will be denoted 0. For w £ G~ k , we note \w\ = k and for 
w £ G~ N , |w| = oo. For every non-positive integer n, we define the projection 
IP : G- N * -> G n be defined by Tl n (w) = w n: _i. 
Endowed with the ultrametric distance 

3(w,z) = 2 su ^ k<0:Wk * Zk \ 

G _N is a complete and compact set. A (closed and open) ball B C G _N is a set 
{zs : z £ G _N } for some s £ G*: we denote s = 7Z(B), and denote B = T(s). 
Observe that the radius of T(s) is 2~I S L Note that T(e) = G~ N . A mapping 
/ defined on G~ N is called piecewise constant if the exists a family {sjjjgN of 
elements of G _N such that / is constant on each ball T{sj). 

For — oo < a < b < +oo and w £ G z , the sequence (w a , . . . , u>&) is denoted 
by Wa-.b, and for two sequences w a -.b and z c: d, — oo < a < b < oo, — oo < 
c < d < oo, w a: bz c -d denotes the concatenation of w a :b and z c -d- Wa-.bZc-.d = 
(w a , ■ ■ ■ , Wb, z c , . . . , Zd). Remark: this notation is different from the convention 
taken in [5]. If a > b, w a: b is the empty sequence e. 
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In the sequel, the set of probability distributions on G is denoted by Mi(G), 
and is endowed with the total variation (or L 1 ) distance 

\p - q\rv = x ~ q ( a ^\ = 1 ~ X] p ( a ) A 9 ( a ) ' 

aeG aeG 

where x Ay denotes the minimum of x and y. 

2.2 Probability transition kernels 

A mapping P : G~ N — > A4\(G) is called a probability transition kernel, and 
we denote the image of w £ G _N by P(-\w). We say that P is continuous if 
it is continuous from (G _N ,6) to (Mi(G), \ ■ \tv)- For s G G*, we define the 
oscillation of P on the ball 7~(s) as: 

= sup{|P(.|«0 - P(-\z)\ TV :w,z£ T(s)} . 
Lemma 1. The following assertions hold: 

(i) P is continuous if and only if Vw £ G _N ,r](w-k:-i) — > as k goes to 
infinity. 

(ii) P is continuous if and only if swp{rj(s) : s E G~ k } — > as k goes to 
infinity. 

(Hi) P is uniformly continuous if and only it is continuous. 

The two first assertions are trivial, and (iii) is a consequence of Heine's 
theorem on the compact set G~ N . 

We say that a (nonnecessarily stationary) process (X t ) tg z with distribution 
v on G z is compatible with kernel P if the latter is a version of the one-sided 
conditional probabilities of the former, that is: 

v (X, = g\X i+j = Wj , j 6 -N*) = P(g\w) 

for all i £ Z, g £ G and i^-almost every w. A classical but key remark is that 
St = (• • • , Xt-i, Xt), t £ Z, is a Markov Chain on the compact ultrametric state 
space G _N , with transition kernel Q given by: 

Vw,ze G~ N , Q(w\z) = P(w-i\z)l Wi _ 1=Zi:i<0 . 

2.3 Coupling functions 

A mapping <f) '■ [0, l[xG _N — s- G is called a coupling of P if, given a random 
variable U uniformly distributed on [0,1[, (j)(U,w) has distribution P(-\w) for 
all w £ G _N * . 

In the sequel, we focus on couplings <f> of P such that: 

Vs G G*,0 < u < 1 - |G|?/(s) •) is constant on T(s) . (1) 

The construction of such a coupling is detailed in Section |4l 

Lemma 2. If P is continuous and if <f> is a coupling of P satisfying ([T]), then 
for all u £ [0, 1[ the mapping w 4>{u,w) is continuous, i.e, piecewise constant. 

Proof: let u £ [0, 1[; the uniform continuity of P implies that there exists 
e such that if S(w,z) < e, then \P(-\w) - P(-\z)\ < (1 - u)/\G\. But then 
Equation ([!]) implies that = 4>(u,z). 



3 



3 Description of the Perfect Simulation Scheme 



Given a continuous transition kernel P, two questions arise: 

1. does there exist a stationary distribution v compatible with PI In that 
case, is it unique? 

2. if v exists, how can we sample finite trajectories from that distribution? 

More or less recently, [SJE1IH] have contributed to answer these questions. Their 
approach is to show that there exists a simulation scheme drawing samples of 
and this algorithm is based on the idea of coupling from the past. Following these 
authors, we address these questions by constructing a new perfect simulation 
scheme that converges faster, and more generally than its predecessors. We 
describe the general principle- of this algorithm here. 

Let n be a negative integer; in order to draw (X n , . . . , X_i) from a stationary 
distribution compatible with P, we use a semi-infinite sequence of i.i.d. random 
variables (Ut)t<o 011 the probabilistic space (f2, A, P), each uniformly distributed 
on the interval [0,1[. For each t < 0, let f t : G~ N * -> G" N * be defined by 
ft (ml) = ML4>(Ut,w)] beware the index shift: if z = ft(w) then Z—i = 4>(Ut,w) 
and Zi = u?t+i for i < — 1. 

For all integers t < u, let F t:u = f u -i ° ■ ■ ■ ° ft, and let F t = Ft.-i- For 
any negative integer n, let also H™ u = II" o F t:u and let H™ = H"_ 1 . As will 
be detailed below, for a continuous kernel Proposition [2] implies that H™ u is a 
piecewise constant mapping. Define 

r(n) = sup{i < n : H" is constant} , 

where by convention r(n) = — oo if for all t < 0, i?" is not constant. The 
sequence (r(n))„ is a non-increasing sequence of stopping times with respect to 
the filtration (J 7 S ) S , where T s = a(Ut : t > s). In conditions on P are 

given that ensure almost-sure finiteness for r(n) (and bounded expectation). 
On the other hand, the authors prove that almost-sure finiteness for r(n) is a 
sufficient condition to prove the existence and unicity of a stationary distribution 
v compatible with P (see [5], Theorem 4.3 and corollaries 4.12 and 4.13). As a 
by-product, one obtains a simulation algorithm for sample paths of v: if U = 
(Ut)tez is a sequence of independent, uniformly distributed random variables, 
one can define $ : [0, l] z ->• G 1 such that $([/)t = 4>{U t -i, *(t/)-oo:t-i) for all 
t, and 

i/ = P(*(Ef) G ■), 

the law of &(U), is stationary and compatible with P. 

But the conditions on P required in [5] are quite restrictive: with the nota- 
tion of Section 2] below, it is assumed that 

oo m 

e n A k < °° 

m>0 fc=0 

This condition require in particular that = X) s gG m ^ {P(d\s) '■ 2. € G _N } > 
0. The authors prove that, under these contions, the process regenerates, and 
that the stopping time 

t'(ti) = supji < 11 : H\ is constant} 
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is almost-surely finite, using a Kalikow-type decomposition of the kernel P as a 
mixture of Markov chains of all orders. . This is obviously sufficient to ensure 
that r(n) is finite, but this not necessary by far. Consider the example of order 
1 Markov chains: while Propp and Wilson [TO] have shown that r(n) is almost 
surely finite for every mixing chain (and, besides, that r(n) has the same order 
of magnitude as the mixing time of the chain), r'(n) is easily seen to be almost 
surely infinite if the Dobrushin coefficient of the chain is 0. The contribution 
of this paper is precisely to fill the gap, by providing for general continuous 
kernels a Propp- Wilson procedure that may converge even if the process is not 
regenerating. In fact, for Markov chains of order 1, <p{u,w) depends only on 
w_i and the algorithm presented in this paper is simply Propp and Wilson's 
exact sampling procedure. 

Since the publication of [5], [HI have generalized these results, showing 
that they are not necessary, by proposing other particular conditions covering 
different cases. Gallo [3] shows that P needs not be continuous to ensure the 
existence of v, nor to ensure the finiteness of t(u): he gives an example of 
a non-continuous regenerating chain, see also Example 17.21 below. In [7J, De 
Santis and Piccioni propose another algorithm wich mixes the ideas of [5] and 
|19| : they propose an hybrid simulation scheme working with a Markov regime 
and a long-memory regime. Our approach is different and more general: we 
describe a single procedure that generalizes the sampling schemes of [5] and [19] 
in a single, unified framework. 



4 Construction of the coupling function 

Provide G with any order, so that G _N can be equiped with the lexicographic 
order : w < z if there exists k e — N such that Vj > k, Wj = Zj and wu < Zk- 
In this section and in the following, we consider a fixed continuous kernel P. 
The continuity of P is locally characterized by some coupling factors that we 
define here together with quantities that will prove useful for the construction 
of the coupling function (j>: for all j e G, let A_i(e) = a-i(g\e) = and for 
ken, we G*, let 

a k (g\w-k-.-i) = m£{P(g\z) : z e T(w- k -.-i)} , 

A k {W- k :-l) = 2J a k(g\W-k:-l) , 

g&G 



Ctk{g\W- k ;-l) = Ak-\{w -k+l:-l) + X] { a k(h\W-k:-l) - 0,k-l(h\W-k+U-l)} i 

h<g 

/3 k (g\w-k:-i) = Ak-i(w-k+i-.-i) + X! {afc(^l w -fc : -i) - ak-i(h\w-k+u-i)} ■ 

h<g 

Note that, with our conventions, ao(g\e) = inf{P(g\z_) : z G G _N }. Moreover, if 
s, s' e G* are such that s >z s', then for all g e G it holds that a,k(g\s) > ak(g\s'), 
Ak{s) > Ah(s') and the sequence (A^ ) is increasing. 
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Proposition 1. The coupling factors of the kernel P satisfy the following in- 
equalities: for all s G G* , 

1 - \G\ V (s) < A lsl (s) < 1 - V (s) . (2) 

Proof. For the upper-bound, observe that 

n(s) = sup {\P(-\w) - P(-\z)\ TV :w,ze T(s)} 

= sup I 1 - p ( a \w) A P(a\z) :w,zeT(s)\ 

{ a£G ) 

= 1 - inf \ ^ P{a\w) A P{a\z) :w,ze T(s) I 
UeG J 

< 1- ^inf{P(a|u;)AP(a|z) :w,z€T(s)} 



aEG 



= 1 - ^ inf {P(a|w) : w € T(s)} 



= 1 - A, S |( S ) . 

For the lower-bound, observe that for all z e T(s), as |-P(-|0s) — -P('U)| T y < 
rj(s), it holds that P(a\z) > P(a\0s) — 7/(s) for all a e G and thus 



A| s| ( S ) = ^inf{P(a|z):zeT( S )} 

a(EG 

> J2 p («io s ) - »?(*) 



l-\G\ V (s) . 



□ 



There is a minor improvement for this result: one can show that 
A \>\ ( s ) = E inf ™ z ) ; ^ e 7»} > 1 - (|G| - l)r,(s) . 

This bound cannot be improved in general. 

Proposition 2. The three assertions are equivalent: 

(i) the kernel P is continuous; 

(ii) \/w € G~ N , Ak(ui-k:-i) —> 1 as k —> oo; 

(Hi) — > 1 as k goes to infinity. 

Proof: The equivalence between (i) and (ii) is a simple consequence of 
Lemma [T] and Proposition [TJ Similarly, (iii) follows from (i) : if P is continuous 
on the compact set G _N , then it is uniformly continuous, and 

f(k) = sup r/(s) 

s€G- k 



G 



as k goes to infinity. But by Proposition [TJ Al > 1 — \G\f(k). Finally, (iii) 
implies (ii). 

The equivalence of (ii) and (iii) can also be proved as a consequence of 

Dini's first theorem: defining A k (w) = A k (w— k: —i), the sequence (^kj IS arL 

increasing sequence of continuous functions simply converging to the (continu- 
ous) constant function f , thus the convergence is uniform. 

Proposition 3. For every w £ G _N , 

[0,1[= □ [a k (g\w- k ^ 1 ),/3 k (g\w- k: - 1 )[. 
3 eG,fceN 

In other words: let u £ [0, 1[. For all w £ G _N , there exists a unique k £ N 
and a unique g £ G such that u £ [a k (g\w- k: -i), /3 k (g\w- k: -i)[. 

Proof: Let m (resp. M) be the minimal (resp. maximal) element of G. 
Then a k {m\w- k .-i) = /3 k {M\w- k :-i) = A k (w- k: -i), and 

[A k - 1 (w-i e +i:-i),A k (w-k:-i)[= |_J [a k (g\w- k ;- 1 ),p k (g\w- k: -x)[ . 

geG 

The result follows from the continuity assumption: A_i(s) = and A k (w— k: -i) 
1 as k goes to infinity. Figure Q] illustrates Proposition [3] on a three-symbols al- 
phabet. 




Figure 1: Graphical representation of a coupling <fi on alphabet {0,1,2}: for 
each w- k ;-i, the intervals [a k (g\w- k --i), /3 k (g\w- k: -i)[ are represented in blue 
(g = 0), red (g = 1) and green (g = 2). For example, P(l|l) = ao(l|e) + 
ai(l|l) = 1/8 + 1/4, and P(0|00) = a (|e) + ai(0|0) + a 2 (0|00) = 1/4+1/8 + 0. 

Proposition [3] allows the following definition: 
Definition 1. The mapping <f> : [0, l[xG~ N — s- G is defined as follows: 

(f>{u,W_) = 9M ak (g),p k (g)[{u) ■ 

seG.fceN 

In words, for every u £ [0, 1[ and for every w £ G~ N , <fi(u,w) is the unique sym- 
bol g£ G such that there exists k £ N satisfying: u£ [a k (g\w— k: —i), (3 k (g\w- k: -i)[. 
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Proposition 4. is a coupling function satisfying Equation JT]): 

Vs e G*,W e [0, l],Vw,z e T(s), u<A| s |(s) =>• 4>{u,w) = <j>{u,z) . 

For every s£G* arid every u < A\ s \ (s) we denote in the sequel <fi(u, s) = 4>{u,w) 
for all w € T{s). 

Proof: We need to prove that if U ~ W([0, 1[), then for all w e G~ N * then 
random variable 4>{U,w) has distribution It is sufficient to prove that 

for all g £ G, 

oo 

^2Pk{g\w-i:-i) ~ a k (g\w-u-i) = P(g\w). 

1=0 

Actually, it holds that 

k k 

22Pk{g\w-h-i)-oik{g\w-i:-i) = } J ai(g\w-i : -i)-ai-i(g\w-i +1: - 1 ) = a k (g\w-i : -i). 

1=0 1=0 

As an increasing sequence upper-bounded by P(g\w), a k (g\w- k: -i) has a limit 
Q{g\w) < P{g\w) as k tends to infinity. By continuity, 

y^,Q(g\ui) = / ] hm a fe (g|w_ fe: _i) = lim } a k (g\vj- k :-i) = hm A k (w- k .-i) = 1, 

* — • * — ' k— >oo fc— >oo — ' fc->oo 
g£G g£G g&G 

and as J2g£G P(g\m) = 1 this implies that for all g S G, Q(g\w) = P(g\w). 
The last part of the proposition is immediate: forw, z € 7~(s),Vfc < |s|, u>_fc : _i = 
z_fe:_i and 

IJ [o<-k(g\w-k:-i),Pk(g\w- k :- 1 )[= [J [a fe (.g|z_ fc: _i),/3 fc (3|z_ fe: _i)[ = [0, A| s |(s)[ 

seG,fe<|s| g6G,/c<]s| 

Having defined the coupling function, we can now describe the simulation scheme 
outlined in Section |3l 

5 The Coupling from the Past Algorithm 
5.1 Complete suffix Dictionaries 

Let h e G* U G~ N *. If s e G* is such that \h\ > \s\ and h-\ s \..-i = s, we say 
that s is a suffix of s and we write h ^ s. 

A (finite or infinite) set of words D C V(G*) is called a complete suffix 
dictionary (CSD) if one of the following equivalent properties is satisfied: 

• every sequence w € G _N has a unique suffix in D: 

VwG G" N *,3!s efl:ajhs; 

• {7~{s) ■ s e D} is a partition of G _N : 

G- N * = u seD T(s) . 
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The CSD D can be represented by a trie, that is, a tree with edges labelled by 
elements of G such that the path from the root to any leaf is labelled by an 
element of D, see Figure O Accordingly, we define the depth of D as the depth 
of this trie: 

d(£>) = sup{|s| : s G D} 

The smallest possible CSD is {e} (its trie is reduced to the root): it has depth 
and size 1. The second smallest is G, it has depth 1. If D and D' are two 
CSD such that Vs <G D', s y D, then we note D' y D. This means that the trie 
representing D 1 entirely shadows that of D. 




i \ 
i \ 
i \ 
/ \ 
/ \ 



Figure 2: Left: the trie representing the Complete Suffix Dictionary D = 
{0,01,11}. Right: {00, 10, 001, 101, 11} h {0, 01, 11}. Both examples concern 
the binary alphabet. 

To every CSD D we associate a mapping, denoted by D, defined on G*UG _N 
as follows: 

• if there exists s G D such that ft ^ s, we note h y D and we define 
3(h) = {s}; 

• otherwise, we define D(h) = {s E D : s y h}. 

5.2 Piecewise constant functions and Complete Suffix Dic- 
tionaries 

Definition 2. For a CSD D, we say that a function f defined on G _N is 
.D-constant if 

Vs e Dy w g T{s)j(w) = f(0s) . 

Then, for every h G G~ N * U G* we define f(h) = f(T{h)) = f (p(h)\ and note 
that if h y D,f(h) is a singleton. 

Definition 3. A function f that is piecewise constant admits a minimal CSD 

, which is defined as the CSD with minimal cardinality such that f is constant 
on each of its elements. 

The minimal CSD D? is such that if s G D* , there exists w G G* such that 
s' = ws_\ s \ + i._i G D and f(s) ^ f(s'). If / is -D-constant, then D^ can be 
obtained by recursive pruning of D: as long as it is possible, prune the nodes 
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whoses children are leaves with the same value for /. Pruning is illustrated at 
the right side of Figure [3J 

We now turn back to the continuous kernel P and to its coupling function 
<j>. For each u £ [0, 1[, Proposition [2] shows that the mapping <j){u, •) is piecewise 
constant; we denote by D[u) = D^ u ^ = its minimal CSD. In the sequel, we 
show how the mappings iJ™ defined in Section [3J can be recursively constructed. 

5.3 Recursive construction of H™ 

Let t and n be negative integers. The mapping Zf™ being piecewise constant, 
we can define D™ = D Ht . 

First, observe that for t > n and for n < k < t, it holds that (H"(w)) k = 
Wk, and H™ cannot be constant: r(n) < n. Besides, for t > n it is easy 
to deduce H\ from . Thus, the coupling algorithm successively considers 
HZl,HZ%, ■ ■ • , if", H£_ 1 ,H%_2 ■ ■ • , and stops for the largest value of t < n such 
that .ff™ is constant. We propose the following recursive construction: 

• Initialization: let D x _ x = G and V.g £ G, Vw G T{s), Hht(w) = g. 

• For t < —1, let for each s £ D(Ut) denote {gt{s)} = <j>(Ut,s), and define 
E™{s) as follows: 

- if sg t {s) h D? +1 , let E?(a) = {s}; 

— otherwise, let 

E?(s) = |J {h} . 

hg t (s)£D? +1 (sg t (s)) 

• Let 

seD(u t ) 

E? is a CSD, and F t " is -constant. 

• D™ is obtained by pruning besides, for t = n, D\ is equal to -Dj +1 
unless D\ +1 = {e}, in which case D\ = G. 

This recursion is illustrated in Figure [3] 

Proof: Let us first check that E" is a CSD. Note that every h £ E™(s) is 
such that h >z s; let w £ T(s), and let {/ig t s } = D t+ i., n {wg s n ). Then ft, £ i?™(s) 
and w £ T(/i), so that T(s) = l\ heE -n^ s )T{h). The result follows, as G _N = 
u seD(u t )T(s). 

Now, we can prove that if™ is i?™-constant. We proceed by induction on 
i < 0. The case £ = — 1 is clear. For t < — 1, let h € -E", then by construction 
/i ^ D(Ut), and thus (j){U t ,h) is the singleton {<7t(s)} for s = D(U t )(h). Now, 
observe that hg t (s) > D™ +1 , thus H^(h) = H^ +1 {hg t (s)) , which is a singleton 
by the induction hypothese. 

Besides, for t = n, first assume that D\ +1 ^ {e}; then \h\ > 1 and Hj:(h) = 
h-iHl"+\{hg t (s)) is a singleton, so that H\ is Dj +1 -constant. Otherwise, if 
D\ +1 = {e}, is constant equal to, say, s' £ G t+1 , then Vw £ G" N *,V.g £ 

G,Ht(wjg) = gs', and thus = G. 

Having described the coupling algorithm, we now give some elements of analysis 
concerning its complexity. We first give sufficient conditions for the algorithm 
to terminate, and then we focus on the special but interesting case of bounded 
memory processes. 
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Figure 3: Obtaining D™ from D t and D™ +1 . For each function 4>(Ut, •), D™ +1 
and £>", we represent a CSD on which it is constant, and the values taken in 
each leaf; here, G = {0, 1}. 



6 Bounding the size of D™ 

6.1 Almost sure termination of the coupling scheme 

In general, the CSD D™ can be arbitrary large with positive probability. In 
[5], conditions are given that ensure the finitcness of r(n) defined above, from 
which bounds on £)" can be deduced. However, these conditions are quite 
restrictive: in particular, it is necessary that Aq(s) > 0. [7] somewhat release 
these conditions using an hybrid simulation scheme, allowing for Aq(e) = 0. 

A very crude bound (ignoring the 'Propp- Wilson' coalescence possibilities of 
the algorithm) in the following: denoting L" = d(-D"), an immediate inspection 
of the recurrence 15.31 leads to: 

L™ < ma,x{X t , L" +1 - 1} if t < n , and 
L\ < max{X t , L\\\ -1,1} if t > n . 

where the X t = d(D(Ut)) are i.i.d. random variables such that Vfc € N, V(X t < 
k) = Aj~. Thus, 

fe-t-i 

nL?<k)>P(L? +1 <k+l)A-> J] AJ- 
This bound permits to recover the results of [S], and one obtains: 

n oo 

E[z?]<£i-n^7- 

k=i j=k 
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6.2 The case of Finite Context Trees 



There is at least one case where it is easy to upper-bound the size of St indepen- 
dently of t < n: when the kernel P actually defines a Finite Context Tree, that is, 
when the mapping w ~^ P( - \w.) is piecewisc constant - then, denoting by D the 
minimal CSD of this mapping, it holds that Vw G G~ N * , P(-\D(w)) = {P(-\w}}. 
Even in that case, the simulation scheme described above is interesting: al- 
though the "plain" Propp- Wilson algorithm could be applied on the first order 
Markov chain {X t+ i, t+ d{D))tez on the extended state space G d ^ D \ the compu- 
tational complexity of such an algorithm might well become rapidly intractable 
if the depth d(D) is not very small, whereas the following property shows that 
our algorithm keeps a possibly much more limited complexity. Such qualities of 
parcimony are precisely the reason why finite context trees have proved success- 
ful in many applications, from Information Theory and universal coding (see 
[23 EH El OH) to biology ([211]) or linguistics [10]. 

Wc say that a CSD D is prefix-closed if every prefix of any sequence in D is 
the suffix of an element of D: 

Vs G D,Vk < \s\,3w G D :wt s_| s | : _ fc . 

A prefix-closed CSD satisfies the following property: 

Lemma 3. If D is a prefix-closed CSD, then for all h G D (or, equivalently, 
for all h y D) and for all a G G, ha y D. 

Proof: If h G G* is such that for some a G G, ha y D does not hold, then 
(as D is a CSD) there exists s G D and s' G G*\{e} such that s = s'ha. But 
then s'h is a prefix of s and, by the prefix-closure property, there exists w £ D 
such that w h s'h. Thus, one cannot have hy D. 

We define the prefix closure D of a CSD D as the minimal prefix-closed set 
containing D, that is, the set of maximal elements (for the partial order y) of 

5 = {«_ W: _ fc :«e D, k<\s\} . 

In other words, D is the smallest set such that for all w G D there exists s G D 
such that s y w. 

Proposition 5. For any complete suffix dictionary D, D is a complete suffix 



dictionary such that 



< \D\ x d(D). 



The majoration is trivial: \D~\ < \D\ < \D\xd(D). It is extremely pessimistic 
in general: many CSD are already prefix-closed, and for most CSD \D~\ is of the 
same order of magnitude as \D\. But in fact, for each positive integer n, there 
exists a CSD D of size n such that \D~\ > c\D\ 2 for some constant c w 0.4. 

Now, assume that D ^ {e}, i.e. that P is not memoryless. 

Proposition 6. For each t < n, Vk < t,D y D" . Thus, a fortiori, \D"\ < 
D~ <\D\x d(D). 

Proof: Wc show that D y D\ by induction on t. First, as P is not memo- 
ryless, D~ y DZ\ = G. Second, assume that D y D™ +1 : it is sufficient to prove 
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Figure 4: Graphical representation of the coupling function of P (blue is for 0, 
red is for 1) 

that if" (or if t > n) is 5)-constant. Observing that D >z D, for every 
U t € [0, 1[ and for every s € "E> it holds that <p(U tl s) is a singleton {gf}. Us- 
ing successively the lemma and the induction hypothese, sg^ >z 5 y D", thus 
H"(s) = H" +1 (sg t (s)) is also a singleton. Finally for t = n, Hi is ^-constant 
because D £ {e}. 

7 Examples 

We first present an example of a specification for which ao = and for which 
the convergence of the coupling coefficients is slow, so that neither the perfect 
simulation scheme of [5], nor its improvement by [7] can be applied; yet, a prob- 
abilistic upper-bound on r can be given proving that there exists a compatible 
stationary process. 

7.1 A continuous process with long memory 

For all k > 0, let 

P(0|01 fe ) = 1 - 1/Vk . 

The coupling coefficients of P are shown in Figure H] Observe that P(1|0) = 
limfc^oo P(0\0l k ) = 1, so that ao = 0. Besides, for k > it holds that Ak+i = 
A k {01 k ) = 1 - 1/Vk, so that 

n 

e n A k < °° 

n fc=2 

and the continuity conditions of [5J [7] do not apply. 

We will show that Algorithm XXX can be used to simulate samples of a 
process X have specification P (so that, in particular, such a process exists; 
uniqueness is straightforward). It is sufficient to show that the stopping time 
t(1) is almost surely finite. Actually, — t(1) is stochasticallly upper-bounded by 
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Figure 5: Convergence of the simulation scheme 



three times a geometric variable of parameter 3/(2\/2)— 1- To simplify notations, 
denote H t = Hf 1 . For every t < -2 if U t -\ < 1 - l/\/2, if f7* > 1 — l/\/2 and 
if Z7t+i < 1 — 1/ v2 then for every w G G~ N we have: 

• < 1 - l/y/2 implies = wlO and iJ t+ i(wl) = H t+2 {0); 

• U t > I - 1/V2 implies / t (w01) = wOll and H t {wQl) = H t+1 (wOU) = 
H t+2 (0) on the other hand, / t (w0) = wOl and H t (wQ) = H t+1 (w01) = 
ff t+2 (0); 

• /7t_i < 1 — l/\/2 implies /t-i(wl) = wlO and /t-i(tuO) = wOl, so that 
fft-iM) = F t _i(a;l) = H t+2 (0), and r > t - 1. 

For every negative integer fc let = {C/3fc_i < 1 — 1/V2} H {C^fc > 1 — 
l/v^jnlt/sfe+i < 1— 1/\/2}- The (i?fe)fc<o are independent events of probability 
3/(2y/2) — 1, which gives the result. 

Thus, the algorithm converges quickly. However, the dictionaries involved 
in the simulation can be very large: in fact, it is easy to see that the depth X t 
of D{Ut) has no expectation: ¥(X t > k) = \ j\fk. Of course, as D™ has a very 
special shape, ad hoc modifications of the algorithm would easily allow, here, 
to draw arbitrary long samples with low computational complexity. 

7.2 A non-continuous process 

Actually, a simple modification of this example show that continuity is ab- 
solutely not necessary to ensure convergence : the proof also applies to any 
kernel P' such that for < k < 2, P'(0|01 fc ) = 1 - 1/Vk, and for any 
w G G~ N ,F'(0|wll) > 1 — l/v2- Such a phenomenon has been studied 
in [S]: Gallo gives sufficient conditions on the shape of the trees, together 
with bounds on transition probabilities, that ensure convergence of his cou- 
pling scheme. However, his approach is quite different and does not cover the 
examples presented here. 



14 



Acknowledgments 



The authors thanks Sandro Gallo, Antonio Galves and Florencia Leonardi (Nu- 
mec, Sao Paulo) for stimulating and cnligthening discussions on chains of infinite 
memory. This work was supported by USP-COFECUB (grant 2009.1.820.45.8). 



References 

[1] G. Bcjcrano and G. Yona. Variations on probabilistic suffix trees: statistical 
modeling and prediction of protein families. Bioinformatics, 17(l):23-43, 
2001. 

Henry Berbee. Chains with infinite connections: uniqueness and Markov 
representation. Probab. Theory Related Fields, 76(2):243-253, 1987. 

P. Biihlmann and A. J. Wyncr. Variable length Markov chains. Ann. 
Statist, 27:480-513, 1999. 

J. R. Busch, P. A. Ferrari, A. G. Flesia, R. Fraiman, S. P. Grynberg, and 
F. Leonardi. Testing statistical hypothesis on random trees and applications 
to the protein classification problem. Annals of applied statistics, 3(2), 2009. 

Francis Comets, Roberto Fernandez, and Pablo A. Ferrari. Processes with 
long memory: regenerative construction and perfect simulation. Ann. Appl. 
Probab., 12(3):921-943, 2002. 

I. Csiszar and Z. Talata. Context tree estimation for not necessarily fi- 
nite memory processes, via BIC and MDL. IEEE Trans. Inform. Theory, 
52(3):1007-1016, 2006. 

Mauro Piccioni Emilio De Santis. A general framework for perfect simula- 
tion of long memory processes, April 2010. 

S. G. Foss, R. L. Tweedie, and J. N. Corcoran. Simulating the invariant 
measures of Markov chains using backward coupling at regeneration times. 
Probab. Engrg. Inform. Sci., 12(3):303-320, 1998. 

Sandro Gallo. Chains with unbounded variable length memory: perfect 
simulation and visible regeneration scheme, 2010. 

A. Galves, C. Galves, J. Garcia, N.L. Garcia, and F. Leonardi. Context 
tree selection and linguistic rhythm retrieval from written texts. ArXiv: 
0902.3619, pages 1-25, 2010. 

A. Garivier. Consistency of the unlimited BIC context tree estimator. IEEE 
Trans. Inform. Theory, 52(10) :4630-4635, 2006. 

Olio Haggstrom. Finite Markov chains and algorithmic applications, vol- 
ume 52 of London Mathematical Society Student Texts. Cambridge Univer- 
sity Press, Cambridge, 2002. 

Theodore E. Harris. On chains of infinite order. Pacific J. Math., 5:707- 
724, 1955. 



15 



[14] S. P. Lalley. Regenerative representation for one-dimensional Gibbs states. 
Ann. Probab., 14(4): 1262-1271, 1986. 

[15] S. P. Lalley. Regeneration in one-dimensional Gibbs states and chains with 
complete connections. Resenhas, 4(3):249-281, 2000. 

[16] D. J. Murdoch and P. J. Green. Exact sampling from a continuous state 
space. Scand. J. Statist, 25(3):483-502, 1998. 

[17] Octav Onicescu and Gheorghe Mihoc. Sur les chaines de variables statis- 
tiques. Bull. Sci. Math., 59:174-192, 1935. 

[18] Octav Onicescu and Gheorghe Mihoc. Sur les chaines statistiques. Comptes 
Rendus de I'Academie des Sciences de Paris, 200:511-512, 1935. 

[19] James Gary Propp and David Bruce Wilson. Exact sampling with coupled 
Markov chains and applications to statistical mechanics. In Proceedings 
of the Seventh International Conference on Random Structures and Algo- 
rithms (Atlanta, GA, 1995), volume 9, pages 223-252, 1996. 

[20] J. Rissanen. A universal data compression system. IEEE Trans. Inform. 
Theory, 29(5):656-664, 1983. 

[21] F.M.J. Willems, Y.M. Shtarkov, and T.J. Tjalkens. The context-tree 
weighting method: Basic properties. IEEE Trans. Inf. Theory, 41(3):653- 
664, 1995. 



16 



